2

Quantization of Neural Networks

Quantization is a strategy that has demonstrated outstanding and consistent success in both

the training and inference of neural networks (NN). NN present unique opportunities for

advancement even though the issues of numerical representation and quantization are as old

as digital computing. Although most of this quantization survey is concerned with inference,

it is essential to note that quantization has also been successful in NN training [8, 42, 63,

105]. Innovations in half-precision and mixed-precision training in particular [47, 80] have

enabled greater throughput in AI accelerators. However, going below half-precision without

significant tuning has proven to be challenging, and most recent quantization research has

concentrated on inference.

2.1

Overview of Quantization

Given an NN model of N layers, we denote its weight set as W = {wn}N

n=1 and the input

feature set as A = {an

in}N

n=1. The wnRCn

out×Cn

in and an

in RCn

in are the convolutional

weight and the input feature map in the n-th layer, respectively, where Cn

in and Cn

out re-

spectively stand for the input channel number and the output channel number. Then, the

outputs an

out can be technically formulated as:

an

out = wn · an

in,

(2.1)

where · represents matrix multiplication. In this paper, we omit the non-linear function

for simplicity. Following the prior works [100], quantized neural network (QNN) intends to

represent wn and an in a low-bit format as

Q : = {q1, · · · , qU},

where qi, i = 1, · · · , U satisfying q1 < · · · < qU, are defined as quantized values of the

original variable. Note that x can be the input feature an or the weights wn. In this way,

qwn QCn

out×Cn

in and qan

inQCn

in such that the float-point convolutional outputs can be

approximated by the efficient XNOR and bit-count instructions as:

an

out qwnqan

in.

(2.2)

The core item of QNNs is how to define a quantization set Q, which is described next.

2.1.1

Uniform and Non-Uniform Quantization

First, we must define a function capable of quantizing the weights and activations of the

NN to a finite set of values. The following is a popular choice for a quantization function:

qx = INT( x

S )Z,

(2.3)

DOI: 10.1201/9781003376132-2

16